time 0
Appendix A V ariational Paragraph Embedder A.1 Selection of substitution rate p
Figure 4: Impact of the proportion of injected noise for learning Paragraph Em-beddings on XSum dataset. (Figure 4). The results of the ablation study are presented in Table 5. Embedder in providing clean and denoised reconstructions. In general, it has been observed that generations progress in a coarse-to-fine manner. The early time step, which is close to 1, tends to be less fluent and generic. This was the nicest stay we have ever had. Turtle Bay was a great resort. This was the nicest stay we have ever had.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Oceania > Australia (0.04)
- North America > United States > Virginia (0.04)
- (12 more...)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Denmark (0.04)
Notes 1A special event x0 is sometimes given at time 0 to mark the beginning of the sequence; the model then generatestherestofthesequenceconditionedonx0
NHP is a thoughtfully designed framework that has been demonstrated effective on temporal data, but our method can also be used for other models with parametric intensityfunctions. In this section, we prove the claim in section 2.2 that argmaxθJLL(θ) = Θ When we take the expectation under p, each summand gets weighted by the probability that x[0,t) and x[t,t+dt) would take on the values in that summand. Therefore,wehaveG θ( t, x[0, t)) < 0since the distributions in equation (9) are distinct for the given history x[0, t). This lemma says: if θ and θ are meaningfully different in that they predict different intensities at time t for some history, then they actually do so for a set of histories of non-zero measure, making this difference visible in the objective functions like JLL(θ) (see above) and JNC(θ) (see Appendix B). We use d to denote the maximal difference between the intensities over (t0,t00), i.e., d If x[0,t) doesn't have any event, then its probability p( x[0,t)) = exp( Suppose that t1 has been shifted by R. Recall that we need order-(1dt)I many such histories.
Appendix A V ariational Paragraph Embedder A.1 Selection of substitution rate p
Figure 4: Impact of the proportion of injected noise for learning Paragraph Em-beddings on XSum dataset. (Figure 4). The results of the ablation study are presented in Table 5. Embedder in providing clean and denoised reconstructions. In general, it has been observed that generations progress in a coarse-to-fine manner. The early time step, which is close to 1, tends to be less fluent and generic. This was the nicest stay we have ever had. Turtle Bay was a great resort. This was the nicest stay we have ever had.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Oceania > Australia (0.04)
- North America > United States > Virginia (0.04)
- (12 more...)
Feature Learning for Interpretable, Performant Decision Trees Supplementary Material 1 Experiment Specification
Here we cover the full specification of the experiments. Some details were omitted from the main text. If there were separate training and test sets, they were combined before creating the random 10-fold split. All attributes are normalized to mean 0 and standard deviation 1. Additional details for each model type follow.
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Fu, Tianyu, Min, Zihan, Zhang, Hanling, Yan, Jichao, Dai, Guohao, Ouyang, Wanli, Wang, Yu
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Wisconsin (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes
Gyger, Tim, Furrer, Reinhard, Sigrist, Fabio
Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Unlearning Works Better Than You Think: Local Reinforcement-Based Selection of Auxiliary Objectives
Bendahi, Abderrahim, Fradin, Adrien, Lerasle, Matthieu
We introduce Local Reinforcement-Based Selection of Auxiliary Objectives (LRSAO), a novel approach that selects auxiliary objectives using reinforcement learning (RL) to support the optimization process of an evolutionary algorithm (EA) as in EA+RL framework and furthermore incorporates the ability to unlearn previously used objectives. By modifying the reward mechanism to penalize moves that do no increase the fitness value and relying on the local auxiliary objectives, LRSAO dynamically adapts its selection strategy to optimize performance according to the landscape and unlearn previous objectives when necessary. We analyze and evaluate LRSAO on the black-box complexity version of the non-monotonic Jump function, with gap parameter $\ell$, where each auxiliary objective is beneficial at specific stages of optimization. The Jump function is hard to optimize for evolutionary-based algorithms and the best-known complexity for reinforcement-based selection on Jump was $O(n^2 \log(n) / \ell)$. Our approach improves over this result to achieve a complexity of $\Theta(n^2 / \ell^2 + n \log(n))$ resulting in a significant improvement, which demonstrates the efficiency and adaptability of LRSAO, highlighting its potential to outperform traditional methods in complex optimization scenarios.
- Europe > Spain > Andalusia > Málaga Province > Málaga (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > France (0.04)
- (4 more...)